Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mas d34 leveled.i459 partialmerge #460

Merged
merged 2 commits into from
Nov 30, 2024

Conversation

martinsumner
Copy link
Owner

@martinsumner martinsumner commented Nov 26, 2024

To resolve #459.

The PR introduces a max_mergebelow size which can be a positive integer, or infinity. It defaults to 24. This is a startup option, but it isn't expected to be something the end-user is encouraged to tune - the configurability is primarily to aid testing.

As presently, the same algorithm (which is largely random) will pick a file at Level N to merge into Level N + 1, when Level N is overflowing and N is the closest level to the root which is overflowing. The same process is then used to discover the list of files at Level N + 1 which the chosen merge file covers. This number can be anywhere from 0 to the file count in Level N + 1.

If a merge from Level N covers less than max_mergebelow files in level N + 1 - the merge will proceed as before. It is expected that this will occur > 99% of the time in the majority of implementations.

If a merge has >= max_mergebelow, the merge will be curtailed when max_mergebelow div 2 files have been created at that level i.e. the number of additions to be made to Level N + 1 is limited to this value. In some circumstances, this may merge all the files (e.g. when N + 1 is a basement and the merged file contains a large number of tombstones) - but in most cases there will be a remainder of KV pairs from the merge file at Level N, and some remaining extracted pairs at Level N + 1 as well as some whole files.

The remainder for Level N will then be written as a new file, as well as for Level N + 1 up to the next whole file that has no yet been touched by the merge.

The manifest change will remove the old Level N file, any Level N + 1 files it was merged into, and then add the remainder file at both Level N and Level N + 1.

The backlog that prompted the merge will still exist - as the files in Level N have not been changed. However, it is likely the next file picked will not be the same one, and will in probability have a lower number of files to merge (as the average is =< 8). This is why in this case only merging half not all the max_mergebelow is required to cease the merge - as it has to be acknowledged that another merge will still be required.

This will stop progress from being halted by long merge jobs, as they will exit out in a safe way after partial completion. In the case where the majority of files covered do not require a merge, then those files will be skipped the next time the remainder file is picked up for merge at Level N.

Nothing actually crashes due to the issue - but looking at the logs there is the polarised stats associated with the issue.  When  merging into L3, you would normally expect to merge into 4 files - but actually we see:

```
2024-11-26 13:18:26.720111+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=348 to Level=3 and FileCounter=5
2024-11-26 13:18:26.905800+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=349 to Level=3 and FileCounter=4
2024-11-26 13:18:27.919675+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=356 to Level=3 and FileCounter=1
2024-11-26 13:18:28.664078+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=361 to Level=3 and FileCounter=1
2024-11-26 13:18:29.332219+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=366 to Level=3 and FileCounter=2
2024-11-26 13:18:29.427063+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=367 to Level=3 and FileCounter=2
2024-11-26 13:18:29.856710+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=372 to Level=3 and FileCounter=1
2024-11-26 13:18:31.350539+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=377 to Level=3 and FileCounter=3
2024-11-26 13:18:31.490793+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=378 to Level=3 and FileCounter=3
2024-11-26 13:18:32.381107+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=384 to Level=3 and FileCounter=5
2024-11-26 13:18:32.950525+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=388 to Level=3 and FileCounter=4
2024-11-26 13:18:33.079716+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=389 to Level=3 and FileCounter=2
2024-11-26 13:18:33.832971+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=396 to Level=3 and FileCounter=4
2024-11-26 13:18:37.110446+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=405 to Level=3 and FileCounter=28
2024-11-26 13:18:38.361759+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=413 to Level=3 and FileCounter=4
2024-11-26 13:18:40.275745+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=414 to Level=3 and FileCounter=33
2024-11-26 13:18:40.380565+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=415 to Level=3 and FileCounter=2
2024-11-26 13:18:40.440991+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=416 to Level=3 and FileCounter=1
2024-11-26 13:18:40.684225+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=418 to Level=3 and FileCounter=3
2024-11-26 13:18:41.528004+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=424 to Level=3 and FileCounter=2
2024-11-26 13:18:42.333724+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=430 to Level=3 and FileCounter=3
2024-11-26 13:18:42.508870+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=431 to Level=3 and FileCounter=4
2024-11-26 13:18:45.869575+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=439 to Level=3 and FileCounter=35
2024-11-26 13:18:47.009045+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=447 to Level=3 and FileCounter=5
2024-11-26 13:18:47.099618+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=448 to Level=3 and FileCounter=2
2024-11-26 13:18:47.312631+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=449 to Level=3 and FileCounter=5
2024-11-26 13:18:47.404270+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=450 to Level=3 and FileCounter=2
2024-11-26 13:18:47.572444+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=451 to Level=3 and FileCounter=4
2024-11-26 13:18:47.662239+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=452 to Level=3 and FileCounter=2
2024-11-26 13:18:49.142911+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=462 to Level=3 and FileCounter=2
2024-11-26 13:18:49.292950+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=463 to Level=3 and FileCounter=2
2024-11-26 13:18:51.333187+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=464 to Level=3 and FileCounter=33
2024-11-26 13:18:52.587977+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=474 to Level=3 and FileCounter=5
2024-11-26 13:18:52.656720+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=475 to Level=3 and FileCounter=2
2024-11-26 13:18:52.828039+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=476 to Level=3 and FileCounter=4
2024-11-26 13:18:54.412641+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=487 to Level=3 and FileCounter=5
2024-11-26 13:18:55.640011+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=495 to Level=3 and FileCounter=3
2024-11-26 13:18:57.600358+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=496 to Level=3 and FileCounter=33
2024-11-26 13:18:58.739281+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=504 to Level=3 and FileCounter=2
2024-11-26 13:18:59.006247+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=505 to Level=3 and FileCounter=6
2024-11-26 13:18:59.095642+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=506 to Level=3 and FileCounter=2
2024-11-26 13:19:00.946004+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=507 to Level=3 and FileCounter=32
2024-11-26 13:19:01.237294+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=508 to Level=3 and FileCounter=5
2024-11-26 13:19:02.096930+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=516 to Level=3 and FileCounter=2
2024-11-26 13:19:02.242496+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=517 to Level=3 and FileCounter=2
2024-11-26 13:19:02.925442+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=522 to Level=3 and FileCounter=2
2024-11-26 13:19:03.009468+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=523 to Level=3 and FileCounter=1
2024-11-26 13:19:03.728324+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=528 to Level=3 and FileCounter=2
2024-11-26 13:19:05.036243+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=535 to Level=3 and FileCounter=6
2024-11-26 13:19:05.214732+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=536 to Level=3 and FileCounter=4
2024-11-26 13:19:06.216435+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=544 to Level=3 and FileCounter=3
2024-11-26 13:19:06.398908+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=545 to Level=3 and FileCounter=4
2024-11-26 13:19:06.560985+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=546 to Level=3 and FileCounter=3
2024-11-26 13:19:09.844005+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=553 to Level=3 and FileCounter=29
2024-11-26 13:19:10.510867+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=554 to Level=3 and FileCounter=10
2024-11-26 13:19:10.928794+00:00 log_level=info log_ref=pc011 db_id=65536 pid=<0.436.0> Merge completed with MSN=555 to Level=3 and FileCounter=4
```
There is a `max_mergebelow` size which can be a positive integer, or infinity.  It defaults to 32.

If a merge from Level N covers less than `max_mergebelow` files in level N + 1 - the merge will proceesd as before.  If it has >= `max_mergebelow`, the merge will be curtailed when `max_mergebelow div 2` files have been created at that level.  The remainder for Level N will then be written, as well as for Level N + 1 up to the next whole file that has no yet been touched by the merge.

The backlog that prompted the merge will still exist - as the files in Level N have not been changed.  However, it is likely the next file picked will not be the same one, and will in probability have a lower number of files to merge (as the average is =< 8).

This will stop progress from being halted by long merge jobs, as they will exit out in a safe way after partial completion.  In the case where the majority of files covered  do not require a merge, then those files will be skipped the next time the remainder file is picked up for merge at Level N
@martinsumner
Copy link
Owner Author

Once an individual vnode or leveled backend reaches 25bn keys in size, then merges into the basement level will increasingly become partial - which may mean multiple merges are required to clear a backlog at Level 6.

At this stage, assuming a ring-size of 512, this amounts to 4.2 trillion keys in the Riak cluster (more if there are a majority of index keys). For Riak, the answer would be to increase the ring-size.

I assume here that nothing needs to be done about this, as this is much bigger vnode size than anticipated. At this stage, there are other potential issues running at such untested scale.

Copy link
Contributor

@ThomasArts ThomasArts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elegant

@martinsumner
Copy link
Owner Author

perfSUITEv9-pr460

The comparison line is an old test of this branch on OTP 24 - and it compares the last 26.2.5 test pre eqwalizer, post eqwalizer and this PR (which includes eqwalizer changes).

The update/load stability and improvements we expect are there. There is also a mysterious improvement in the GET time. there is also a mysterious degradation in the query time, which is mainly in smaller volume test runs.

These things I can't explain, unless there is some secondary impact associated with page caches. i.e. The big merges primarily impacted SST files with index entries, and promoted them in the cache - now the SST files covering object entries are more likely to be in the page cache - so GET/HEAD improve at the cost of QUERY?

@martinsumner martinsumner merged commit 69e8b29 into develop-3.4 Nov 30, 2024
2 checks passed
@martinsumner martinsumner deleted the mas-d34-leveled.i459-partialmerge branch November 30, 2024 13:16
martinsumner added a commit that referenced this pull request Nov 30, 2024
* Add test to replicate issue 459

Nothing actually crashes due to the issue - but looking at the logs there is the polarised stats associated with the issue.  When  merging into L3, you would normally expect to merge into 4 files - but actually we see FileCounter occasionally spiking.

* Add partial merge support

There is a `max_mergebelow` size which can be a positive integer, or infinity.  It defaults to 32.

If a merge from Level N covers less than `max_mergebelow` files in level N + 1 - the merge will proceesd as before.  If it has >= `max_mergebelow`, the merge will be curtailed when `max_mergebelow div 2` files have been created at that level.  The remainder for Level N will then be written, as well as for Level N + 1 up to the next whole file that has no yet been touched by the merge.

The backlog that prompted the merge will still exist - as the files in Level N have not been changed.  However, it is likely the next file picked will not be the same one, and will in probability have a lower number of files to merge (as the average is =< 8).

This will stop progress from being halted by long merge jobs, as they will exit out in a safe way after partial completion.  In the case where the majority of files covered  do not require a merge, then those files will be skipped the next time the remainder file is picked up for merge at Level N
martinsumner added a commit that referenced this pull request Nov 30, 2024
* Mas d34 leveled.i459 partialmerge (#460)

* Add test to replicate issue 459

Nothing actually crashes due to the issue - but looking at the logs there is the polarised stats associated with the issue.  When  merging into L3, you would normally expect to merge into 4 files - but actually we see FileCounter occasionally spiking.

* Add partial merge support

There is a `max_mergebelow` size which can be a positive integer, or infinity.  It defaults to 32.

If a merge from Level N covers less than `max_mergebelow` files in level N + 1 - the merge will proceesd as before.  If it has >= `max_mergebelow`, the merge will be curtailed when `max_mergebelow div 2` files have been created at that level.  The remainder for Level N will then be written, as well as for Level N + 1 up to the next whole file that has no yet been touched by the merge.

The backlog that prompted the merge will still exist - as the files in Level N have not been changed.  However, it is likely the next file picked will not be the same one, and will in probability have a lower number of files to merge (as the average is =< 8).

This will stop progress from being halted by long merge jobs, as they will exit out in a safe way after partial completion.  In the case where the majority of files covered  do not require a merge, then those files will be skipped the next time the remainder file is picked up for merge at Level N

* Optimise test

Test made faster through backporting testutil:get_compressiblevalue/0 from develop-3.4.

Also use lower max_sstslots to invoke condition with fewer kesy, and reduce test time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need for partial merge
2 participants